AITopics | Bolivar Department

Collaborating Authors

Bolivar Department

AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis

Ma, Haroui, Quinzan, Francesco, Willem, Theresa, Bauer, Stefan

arXiv.org Machine LearningApr-28-2025

Machine learning (ML) systems for medical imaging have demonstrated remarkable diagnostic capabilities, but their susceptibility to biases poses significant risks, since biases may negatively impact generalization performance. In this paper, we introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics. Our method leverages the concept of counterfactual invariance, measuring the extent to which a model's predictions remain unchanged under hypothetical changes to sensitive attributes. We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases without requiring direct access to counterfactual data. Through experiments on synthetic datasets and large-scale real-world medical imaging datasets, including \textsc{cheXpert} and MIMIC-CXR, we demonstrate that our approach aligns closely with counterfactual fairness principles and outperforms standard baselines. This work provides a robust tool to ensure that ML diagnostic systems generalize well, e.g., across demographic groups, offering a critical step towards AI safety in healthcare. Code: https://github.com/Neferpitou3871/AI-Alignment-Medical-Imaging.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Machine Learning

2504.19621

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(12 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

CrossFormer: Cross-Segment Semantic Fusion for Document Segmentation

Ni, Tongke, Fan, Yang, Zhou, Junru, Wu, Xiangping, Chen, Qingcai

arXiv.org Artificial IntelligenceApr-2-2025

Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer-based model featuring a novel cross-segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk methods within the Retrieval-Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer's state-of-the-art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.23671

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
South America > Colombia > Bolivar Department > Cartagena (0.04)
North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
(12 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Züfle, Maike, Papi, Sara, Savoldi, Beatrice, Gaido, Marco, Bentivogli, Luisa, Niehues, Jan

arXiv.org Artificial IntelligenceFeb-24-2025

Scientific communication is receiving increasing attention in natural language processing, especially to help researches access, summarize, and generate content. One emerging application in this area is Speech-to-Abstract Generation (SAG), which aims to automatically generate abstracts from recorded scientific presentations. SAG enables researchers to efficiently engage with conference talks, but progress has been limited by a lack of large-scale datasets. To address this gap, we introduce NUTSHELL, a novel multimodal dataset of *ACL conference talks paired with their corresponding abstracts. We establish strong baselines for SAG and evaluate the quality of generated abstracts using both automatic metrics and human judgments. Our results highlight the challenges of SAG and demonstrate the benefits of training on NUTSHELL. By releasing NUTSHELL under an open license (CC-BY 4.0), we aim to advance research in SAG and foster the development of improved models and evaluation methods.

computational linguistic, dataset, evaluation, (16 more...)

arXiv.org Artificial Intelligence

2502.16942

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(11 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

Enhanced MRI Representation via Cross-series Masking

Wang, Churan, Gao, Fei, Yan, Lijun, Wang, Siwen, Yu, Yizhou, Wang, Yizhou

arXiv.org Artificial IntelligenceDec-10-2024

Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning treatment in various medical conditions due to its ability to produce multi-series images that reveal different tissue characteristics. However, integrating these diverse series to form a coherent analysis presents significant challenges, such as differing spatial resolutions and contrast patterns meanwhile requiring extensive annotated data, which is scarce in clinical practice. Due to these issues, we introduce a novel Cross-Series Masking (CSM) Strategy for effectively learning MRI representation in a self-supervised manner. Specifically, CSM commences by randomly sampling a subset of regions and series, which are then strategically masked. In the training process, the cross-series representation is learned by utilizing the unmasked data to reconstruct the masked portions. This process not only integrates information across different series but also facilitates the ability to model both intra-series and inter-series correlations and complementarities. With the learned representation, the downstream tasks like segmentation and classification are also enhanced. Taking brain tissue segmentation, breast tumor benign/malignant classification, and prostate cancer diagnosis as examples, our method achieves state-of-the-art performance on both public and in-house datasets.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2412.07387

Country:

Europe > Switzerland (0.04)
South America > Colombia > Bolivar Department > Cartagena (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Zeng, Aohan, Du, Zhengxiao, Liu, Mingdao, Wang, Kedong, Jiang, Shengmin, Zhao, Lei, Dong, Yuxiao, Tang, Jie

arXiv.org Artificial IntelligenceDec-3-2024

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.02612

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
(12 more...)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Zeng, Aohan, Du, Zhengxiao, Liu, Mingdao, Zhang, Lei, Jiang, Shengmin, Dong, Yuxiao, Tang, Jie

arXiv.org Artificial IntelligenceDec-2-2024

Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to textbased large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speechtext datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain. All NLP tasks are generation tasks. Figure 1: (Left) The performance on Spoken QA continuously improves as the amount of synthetic interleaved data increases, significantly surpassing the previous SOTA (Moshi). Work was done when ML, LZ interned at Zhipu.AI. Large language models (LLMs) have significantly advanced natural language processing, demonstrating capabilities beyond traditional language tasks. Trained on vast internet corpora, they exhibit emergent abilities such as instruction following (Ouyang et al., 2022), logical reasoning (Wei et al., 2022), and tool utilization (Schick et al., 2023). These advancements have enabled applications like interactive chatbots and personalized digital assistants. However, an ideal AI assistant should not rely solely on text. Voice-based interaction offers a more natural and intuitive interface for human-AI interaction. Traditional voice-based systems combine Automatic Speech Recognition (ASR), LLMs, and Text-to-Speech (TTS) models in a cascading manner. This approach, however, suffers from information loss during ASR and TTS processes, limiting the ability to capture and express the rich nuances of speech.

dataset, interleaved data, language model, (15 more...)

arXiv.org Artificial Intelligence

2411.17607

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
(16 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Health & Medicine (0.92)
Leisure & Entertainment > Games (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Direct Speech-to-Speech Neural Machine Translation: A Survey

Gupta, Mahendra, Dutta, Maitreyee, Maurya, Chandresh Kumar

arXiv.org Artificial IntelligenceNov-13-2024

Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a comprehensive review of direct S2ST models, data and application issues, and performance metrics. We critically analyze the models' performance over the benchmark datasets and provide research challenges and future directions.

machine learning, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2411.14453

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > Canada > Ontario > Toronto (0.04)
(34 more...)

Genre: Overview (1.00)

Industry:

Education (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach

Li, Siqi, Liu, Danni, Niehues, Jan

arXiv.org Artificial IntelligenceSep-13-2024

Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences, impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these valuable resources, we propose a retrieval-and-demonstration approach to enhance rare word translation accuracy in direct ST models. First, we adapt existing ST models to incorporate retrieved examples for rare word translation, which allows the model to benefit from prepended examples, similar to in-context learning. We then develop a cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to locate suitable examples. We demonstrate that standard ST models can be effectively adapted to leverage examples for rare word translation, improving rare word translation accuracy over the baseline by 17.6% with gold examples and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval approach outperforms other modalities and exhibits higher robustness to unseen speakers. Our code is publicly available (https://github.com/SiqiLii/Retrieve-and-Demonstration-ST).

computational linguistic, rare word, translation, (15 more...)

arXiv.org Artificial Intelligence

2409.09009

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
South America > Colombia > Bolivar Department > Cartagena (0.04)
(23 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Foundation Models for Music: A Survey

Ma, Yinghao, Øland, Anders, Ragni, Anton, Del Sette, Bleiz MacSen, Saitis, Charalampos, Donahue, Chris, Lin, Chenghua, Plachouras, Christos, Benetos, Emmanouil, Shatri, Elona, Morreale, Fabio, Zhang, Ge, Fazekas, György, Xia, Gus, Zhang, Huan, Manco, Ilaria, Huang, Jiawen, Guinot, Julien, Lin, Liwei, Marinelli, Luca, Lam, Max W. Y., Sharma, Megha, Kong, Qiuqiang, Dannenberg, Roger B., Yuan, Ruibin, Wu, Shangda, Wu, Shih-Lun, Dai, Shuqi, Lei, Shun, Kang, Shiyin, Dixon, Simon, Chen, Wenhu, Huang, Wenhao, Du, Xingjian, Qu, Xingwei, Tan, Xu, Li, Yizhi, Tian, Zeyue, Wu, Zhiyong, Wu, Zhizheng, Ma, Ziyang, Wang, Ziyu

arXiv.org Artificial IntelligenceSep-3-2024

In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.

audio-visual joint representation learning, pattern analysis and machine intelligence, tsinghua shenzhen international graduate school, (16 more...)

arXiv.org Artificial Intelligence

2408.1434

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.13)
North America > United States > California > San Francisco County > San Francisco (0.13)
(62 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > Experimental Study (1.00)
Instructional Material (1.00)
Overview > Innovation (0.67)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Law > Intellectual Property & Technology Law (1.00)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
(9 more...)

Add feedback

Measuring the Accuracy of Automatic Speech Recognition Solutions

Kuhn, Korbinian, Kersken, Verena, Reuter, Benedikt, Egger, Niklas, Zimmermann, Gottfried

arXiv.org Artificial IntelligenceAug-29-2024

For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of eleven common ASR services with recordings of Higher Education lectures. We evaluated the influence of technical conditions like streaming, the use of vocabularies, and differences between languages. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events. Our study shows that despite the recent improvements of ASR, common services lack reliability in accuracy.

accuracy, dataset, transcription, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3636513

2408.16287

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.06)
North America > United States > New York > New York County > New York City (0.04)
(25 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Services (0.68)
Education > Educational Setting > Higher Education (0.49)
Health & Medicine > Therapeutic Area > Otolaryngology (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback